Covid-19 Country-wise Data Analysis

A supplement to the manuscript:

An Epidemics Model for Non-First-Order Transmission Kinetics
Eun-Young Mun$^{*1}$ and Feng Geng$^{2}$
$^{1}$ Department of Health Behavior and Health Systems, School of Public Health, University of North Texas Health Science Center, Fort Worth, TX, USA
$^{2}$ School of Professional Studies, Northwestern University, Chicago, IL, USA

Background

Data analytics is a powerful tool for extracting knowledge from data. Few bodys of knowledge is more vital for human welfare than that of the pandemic-causing virus. How does such a virus spread? What makes it spread fast or slowly? What measures are most effective suppressing it? The answers to these questions are prerequisite to any knowledge based actions in combating a pandemic.
At present time of September 2020, Covid-19 pandemic has running for six months and is not showing any sign of abating in the US. The davastating impact of pandemics illustrate the importance of developing the capability to quickly gain knowledge on any emerging infectious agent in general. Meanwhile, real data collected on this specific corona virus is a great source to develop and practice such capability, and build knowledge to guide policy making in combating the ongoing pandemic.

Aim

This project is aimed at analyzing real Covid-19 data of multiple counties, and creating models that correlate the disease growth rate with the demographic characteristics and the disease control measures taken by the countries. From these discoveries, recommendations of policies or action plans will be made.

Original Data

Covid-19 data used in this project is made available by Our World in Data (ourworldindata.org), at the location: https://github.com/owid/covid-19-data/tree/master/public/data.

The data downloaded for this project spans from January 1st to June 30th, 2020. The CSV files follow a format of 1 row per location and date. The locations in this dataset are in most cases countries, as well as some regions that are administered independently, such as Taiwan. The variables represent all of the main data related to confirmed cases, deaths, and testing, as well as other variables of potential interest.

Since June 30th, there has been new parameters added to the dataset. For the convenience of readers , the dataset downloaded on June 30th is attached as 'owid-covid-data.csv'. Note that to run the code posted here, one needs to change the path accordingly.

In [101]:
import os
import numpy as np
import pandas as pd
from collections import Counter  
import math
import statistics as stat

import missingno as msno
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer as Impute
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor

import statsmodels.api as sm
from statsmodels.formula.api import glm

import matplotlib.pyplot as plt 
from matplotlib.ticker import MaxNLocator
import seaborn as sns
In [102]:
from datetime import datetime
date_str = datetime.now().strftime('%Y-%m-%d')
In [103]:
pd.set_option('display.max_column', None)
pd.set_option('display.max_row', None)
pd.set_option('display.max_seq_items', None)

Data processing

In [104]:
path = "C:\\Users\\80136730\\DataScience\\Covid19\\Original data"
file = 'owid-covid-data.csv'
data = pd.read_csv(os.path.join(path, file))
In [105]:
# quick survey of features
data.columns
Out[105]:
Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'total_deaths', 'new_deaths', 'total_cases_per_million',
       'new_cases_per_million', 'total_deaths_per_million',
       'new_deaths_per_million', 'total_tests', 'new_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'tests_units',
       'stringency_index', 'population', 'population_density', 'median_age',
       'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cvd_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy'],
      dtype='object')
In [106]:
# shorten some feature names
data.columns = ['loc_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'total_deaths', 'new_deaths', 'total_cases_pmm',
       'new_cases_pmm', 'total_deaths_pmm',
       'new_deaths_pmm', 'total_tests', 'new_tests',
       'total_tests_pm', 'new_tests_pm',
       'new_tests_sm', 'new_tests_pmsm', 'tests_units',
       'stringency_index', 'population', 'pop_den', 'median_age',
       'older_65', 'older_70', 'gdp_pc', 'extreme_poverty',
       'cvd_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'handwashing_facilities', 'hospital_beds_pm',
       'life_expectancy']
In [107]:
# many country names like 'South Korea', 'Cote_d'Ivoire' will cause problem
data.replace({'location': {" ":"_", "'":"_", "-":"_"}}, regex=True, inplace=True)
data.replace({'continent': {" ":"_", "'":"_", "-":"_"}}, regex=True, inplace=True)
In [108]:
# analysis will be using "new cases per million (pmm)". Drop some variables (columns).
df1 = data.drop(['total_cases', 'new_cases','total_deaths', 'new_deaths',
                 'total_tests', 'new_tests','total_tests_pm',  'new_tests_pm', 
                 'new_tests_sm', 'tests_units'], axis=1)
In [109]:
# remove aggregate data and small country data
df1.drop(df1[df1['location'].isin(['International','World'])].index, inplace=True)
df1.drop(df1[df1['population'] < 1000000].index, inplace=True)
In [110]:
# replace some negative numbers (due to data adjustments?) with zero
columns = ['total_cases_pmm', 'new_cases_pmm', 'total_deaths_pmm', 'new_deaths_pmm']
for col in columns:
    df1.loc[df1[col]<0, col] = 0
In [111]:
# convert 4 key variables to 7 day rolling average
for column in columns:
    new_col=[]
    for row in sorted(set(df1['location'])):
        section = df1.loc[df1['location']==row, column]
        new_col.append(section.rolling(window=7).mean().to_list())
    df1['{}r'.format(column)] = [item for sublist in new_col for item in sublist]
In [112]:
# now drop the original unsmoothed columns
df2 = df1.drop(columns = columns)
In [113]:
# there are a lot of days with zero case at the beginning, drop them  
df3 = df2[df2['total_cases_pmmr']>=1].copy()
In [114]:
# make a time variable marking the days of epidemic for each country
days = [1]*df3.shape[0]
for i in range(1, df3.shape[0]):
    if df3.iloc[i]['location']==df3.iloc[i-1]['location']:
        days[i] = days[i-1] +1
    else:
        pass
        
df3['days'] = days
In [115]:
# some locations are missing population density data
set(df3[df3['pop_den'].isna()].loc_code)
Out[115]:
{'SSD', 'SYR', 'TWN'}
In [116]:
# from www.worldometers.info, SSD, SYR and TWN pop_den in per square km
df3.loc[df3['loc_code']=='SSD', 'pop_den']=18
df3.loc[df3['loc_code']=='SYR', 'pop_den']=95
df3.loc[df3['loc_code']=='TWN', 'pop_den']=651

df3['newCase_den'] = df3['new_cases_pmmr']*df3['pop_den']  #the unit is cases per million square km
In [117]:
# make a two-week cumulative of new cases as approximation for active cases
active = df3.groupby('location', as_index=False).rolling(14, min_periods=1).newCase_den.sum()
df3['active'] = active.reset_index(level=0, drop=True)
In [118]:
# convert key variables to logarithm
df3['log $t$'] = np.log(df3.days+1)
df3['log_total_case'] = np.log(df3.total_cases_pmmr+1)
df3['log_new_case'] = np.log(df3.new_cases_pmmr+1)
df3['log_total_death']= np.log(df3['total_deaths_pmmr']+1)
df3['log_new_death']= np.log(df3.new_deaths_pmmr+1)
df3['log_pop']= np.log(df3['population'])
df3['log_pop_den']= np.log(df3['pop_den'])
df3['log [$I$]']= np.log(df3['active'])

# drop the old feature columns
df4 = df3.drop(columns=['date', 'log_total_death', 'log_new_death', 
                  'total_deaths_pmmr', 'new_deaths_pmmr', 'active'])

# drop countries that have less than 2.4 total_cases_pmmr at day 14
drop_list24 = df4[(df4['days']==14) & (df4['total_cases_pmmr']<=2.4)].loc_code.tolist()

df5 = df4[~df4['loc_code'].isin(drop_list24)].copy()

#df5.to_csv('df5 {date}.csv'.format(date=date_str),index=False)
C:\Users\80136730\Anaconda3\lib\site-packages\pandas\core\series.py:726: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
In [119]:
# Next, truncate off the data after log_days>2.5 (12 days) or log_days>3 (20 days)
df5_trunc25 = df5[df5['log $t$']<=2.5].copy()
df5_trunc30 = df5[df5['log $t$']<=3.0].copy()

# df5_trunc25.to_csv('df5_trunc25 {date}.csv'.format(date=date_str),index=False)
# df5_trunc30.to_csv('df5_trunc30 {date}.csv'.format(date=date_str),index=False)
In [120]:
# visualize the dataset

fig, axs = plt.subplots(2, 2, figsize=(9,6), dpi=300, sharex=True)

sns.lineplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Africa'], 
           hue="loc_code", palette="colorblind",  legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Africa'], 
           hue="loc_code", palette="colorblind",  legend=None, s=10)
axs[0,0].legend(title='Africa', loc='upper left', title_fontsize='12',frameon=False)
axs[0,0].set_ylabel("ln[$I$]", fontsize='12')
axs[0,0].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.lineplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Asia'], 
           hue="loc_code", palette="colorblind",  legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Asia'], 
           hue="loc_code", palette="colorblind",  legend=None, s=10)
axs[0,1].legend(title='Asia', loc='upper left', title_fontsize='12',frameon=False)
axs[0,1].set_ylabel("")
axs[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.lineplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['Europe','Oceania'])], 
           hue="loc_code", palette="colorblind",  legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['Europe','Oceania'])], 
           hue="loc_code", palette="colorblind",  legend=None, s=10)
axs[1,0].legend(title='Europe & Oceania', loc='upper left', title_fontsize='12',frameon=False)
axs[1,0].set_ylabel("ln[$I$]", fontsize='12')
axs[1,0].set_xlabel("ln $t$", fontsize='12')
axs[1,0].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.lineplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['North_America','South_America'])], 
           hue="loc_code", palette="colorblind",  legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['North_America','South_America'])], 
           hue="loc_code", palette="colorblind",  legend=None, s=10)
axs[1,1].legend(title='Americas', loc='upper left', title_fontsize='12',frameon=False)
axs[1,1].set_ylabel("")
axs[1,1].set_xlabel("ln $t$", fontsize='12')
axs[1,1].yaxis.set_major_locator(MaxNLocator(integer=True));
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
In [121]:
sns.set(font='Arial')
sns.set_style('white')
fig, axs = plt.subplots(2, 2, figsize=(9, 6), dpi=300, sharex=True)

sns.lineplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Africa'], 
           hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Africa'], 
           hue="loc_code", palette="colorblind", legend=None, s=10)
axs[0,0].legend(title='Africa', loc='upper left', title_fontsize='12',frameon=False)
axs[0,0].set_ylabel("ln[$I$]", fontsize='12')

sns.lineplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Asia'], 
           hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Asia'], 
           hue="loc_code", palette="colorblind", legend=None, s=10)
axs[0,1].legend(title='Asia', loc='upper left', title_fontsize='12',frameon=False)
axs[0,1].set_ylabel("")
axs[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.lineplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['Europe','Oceania'])], 
           hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['Europe','Oceania'])], 
           hue="loc_code", palette="colorblind", legend=None, s=10)
axs[1,0].legend(title='Europe & Oceania', loc='upper left', title_fontsize='12',frameon=False)
axs[1,0].set_ylabel("ln[$I$]", fontsize='12')
axs[1,0].set_xlabel("ln $t$", fontsize='12')

sns.lineplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['North_America','South_America'])], 
           hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['North_America','South_America'])], 
           hue="loc_code", palette="colorblind", legend=None, s=10)
axs[1,1].legend(title='Americas', loc='upper left', title_fontsize='12',frameon=False)
axs[1,1].set_ylabel("")
axs[1,1].set_xlabel("ln $t$", fontsize='12');
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.

Missing data

In [122]:
plot = msno.matrix(df5_trunc25)
plt.gcf().subplots_adjust(top=0.35)
fig = plot.get_figure()
#fig.savefig('missing_acti.tif')
In [123]:
# quick look at the number of missing values, by columns
df5_trunc25.isna().sum().sort_values().tail(10) 
Out[123]:
median_age                  11
older_70                    22
gdp_pc                      22
stringency_index            45
hospital_beds_pm           158
female_smokers             275
male_smokers               297
extreme_poverty            444
handwashing_facilities     785
new_tests_pmsm            1033
dtype: int64
In [124]:
# Impute missing data in numeric independent variables
# separate object columns for missing data imputation and scaling
# also, drop some columns with too much missing data
# put aside some veriables that either are not numeric, or have too many missing
X = df5_trunc25.drop(['loc_code', 'continent', 'location', 'new_tests_pmsm', 'handwashing_facilities', 'extreme_poverty'], axis=1)
In [125]:
X.isna().sum().sort_values()
Out[125]:
log [$I$]                0
newCase_den              0
days                     0
new_cases_pmmr           0
total_cases_pmmr         0
log_pop_den              0
log_new_case             0
log $t$                  0
log_pop                  0
pop_den                  0
population               0
log_total_case           0
cvd_death_rate          11
life_expectancy         11
older_65                11
median_age              11
diabetes_prevalence     11
gdp_pc                  22
older_70                22
stringency_index        45
hospital_beds_pm       158
female_smokers         275
male_smokers           297
dtype: int64
In [126]:
# use Impute to fill in each row's missing features
# dummy codes should help imputing, but it takes long time!
imp = Impute(max_iter = 100, random_state = 123)
X_imputed = pd.DataFrame(imp.fit_transform(X), columns=X.columns, index=X.index)
In [127]:
# scale the imputed independant parameters
scaler = MinMaxScaler(feature_range=(0, 100))
In [128]:
# don't scale pop_den, days, log t, or the dependent variables
X = X_imputed.drop(['pop_den','log $t$', 'days', 'log [$I$]'], axis=1)
In [129]:
scaler100 = scaler.fit(X)
array_X = scaler100.transform(X).round(2)
X_scaled = pd.DataFrame(array_X, columns=X.columns, index=X.index)
In [130]:
plt.figure(figsize=(15,5))
ax = sns.boxplot(data=X_scaled)
ax.set_xticklabels(ax.get_xticklabels(), rotation=-45, ha='left', fontsize=16)
ax.tick_params(axis='y', labelsize=16);
fig = ax.get_figure();
In [131]:
df6 = pd.concat([df5_trunc25[['loc_code', 'location', 'continent', 'days', 'log $t$', 'log [$I$]', 'new_tests_pmsm']], X_imputed['pop_den'], X_scaled], axis=1)
df6.shape
Out[131]:
(1533, 27)
In [132]:
#df6.to_csv('df6 {date}.csv'.format(date=date_str), index=False)
In [ ]:
 

Linear regression to calculate n and k, two key coefficients in the mathematical model

〖 𝐥𝐧〗⁡[𝑰]= ( 𝐥𝐧⁡〖[(𝟏−𝒏)〗 𝒌])/(𝟏−𝒏)+𝟏/(𝟏−𝒏)∙𝐥𝐧⁡𝒕image.png

In [133]:
# linear regression log[I]~logt

coef = []
inte = []
r2 = []

tests = []
index_d = []
locCode = []
aic = []
bic = []

for loca in sorted(set(df6.loc_code)):
    subset = df6[df6.loc_code == loca]
    mi = subset['log [$I$]'].idxmax()
    y=subset['log [$I$]']
    X=subset['log $t$']
    X = sm.add_constant(X)
    ols = sm.OLS(y,X).fit()
    aic.append(ols.aic)
    bic.append(ols.bic)
    if len(ols.params.values)==2:
        coef.append(ols.params.values[1])
        inte.append(ols.params.values[0])
        r2.append(ols.rsquared)
        locCode.append(loca)
        index_d.append(mi)
        tests.append(subset['new_tests_pmsm'].mean())
    else:
        print(loca)
In [134]:
stat.mean(aic), stat.mean(bic)
Out[134]:
(-22.448570662970656, -21.667231558969323)
In [135]:
# collect the results
mydict = { 'loc_code':locCode, 'tests':tests, 'inte_ld':inte, 'coef_ld':coef, 'r2_ld':r2}
collect = pd.DataFrame(mydict, index=index_d)
In [136]:
# calculate n, m and ko
collect['$n$'] = collect['coef_ld'].apply(lambda x: (1-1/x))
collect['$m$'] = collect['$n$'].apply(lambda x: 1/x)

c = []
for i in range(collect.shape[0]):
    k = collect['coef_ld'].iloc[i]*(math.exp(collect['inte_ld'].iloc[i]/collect['coef_ld'].iloc[i]))
    c.append(k) 
collect['$k$'] = c

collect['$ko$'] = [999]*collect.shape[0]
for loca in collect['loc_code']:
    collect.loc[collect['loc_code']==loca, '$ko$'] = collect.loc[collect['loc_code']==loca, '$k$']/df6.loc[df6['loc_code']==loca, 'pop_den']
    
In [137]:
collect.head()
Out[137]:
loc_code tests inte_ld coef_ld r2_ld $n$ $m$ $k$ $ko$
87 AFG NaN 1.220372 1.742216 0.998878 0.426018 2.347317 3.510046 0.064497
188 ALB NaN 4.465838 1.384515 0.991902 0.277726 3.600677 34.845028 0.332266
25183 ARE NaN 0.525894 1.799161 0.990956 0.444185 2.251312 2.409983 0.021433
888 ARG NaN -0.288941 2.054539 0.996448 0.513273 1.948282 1.784995 0.110342
1057 ARM NaN 2.391424 2.486335 0.998232 0.597802 1.672796 6.505427 0.063202
In [138]:
collect.shape
Out[138]:
(140, 9)
In [139]:
# stats of key parameters

stat.mean(collect['$n$']), stat.stdev(collect['$n$']),stat.variance(collect['$n$']),
Out[139]:
(0.43321404103891775, 0.14118384745404972, 0.019932878781928377)
In [140]:
stat.mean(collect['$ko$']), stat.stdev(collect['$ko$']),stat.variance(collect['$ko$']),
Out[140]:
(0.15466956004203056, 0.372330382213497, 0.13862991351924878)
In [141]:
stat.mean(collect['$m$']), stat.stdev(collect['$m$']),stat.variance(collect['$m$']),
Out[141]:
(2.206956171166165, 4.35465261507451, 18.96299939797527)
In [142]:
# let's look at the extreme outliers
n_extreme = collect[(collect['$n$']>(stat.mean(collect['$n$'])+3*stat.stdev(collect['$n$'])))|(collect['$n$']<(stat.mean(collect['$n$'])-3*stat.stdev(collect['$n$'])))].index
ko_extreme = collect[(collect['$ko$']>(stat.mean(collect['$ko$'])+3*stat.stdev(collect['$ko$'])))|(collect['$ko$']<(stat.mean(collect['$ko$'])-3*stat.stdev(collect['$ko$'])))].index

n_extreme, ko_extreme
Out[142]:
(Int64Index([16361, 18766], dtype='int64'), Int64Index([7949], dtype='int64'))
In [143]:
collect.loc[[16361, 18766, 7949]]
Out[143]:
loc_code tests inte_ld coef_ld r2_ld $n$ $m$ $k$ $ko$
16361 MNG NaN -0.555697 0.934282 0.947490 -0.070340 -14.216593 0.515426 0.260316
18766 PSE NaN 6.123839 0.976655 0.982354 -0.023903 -41.835435 516.252864 0.663392
7949 EST 0.355545 5.809186 1.269726 0.976809 0.212428 4.707470 123.217473 3.970531
In [144]:
collect.drop([16361, 18766, 7949]).mean(axis=0)
Out[144]:
tests       0.072455
inte_ld     2.008159
coef_ld     1.891331
r2_ld       0.990661
$n$         0.441838
$m$         2.630061
$k$        14.717428
$ko$        0.122332
dtype: float64
In [145]:
# visualize the results
from matplotlib.gridspec import GridSpec
from scipy import stats
In [146]:
x=collect['$ko$']
y=collect['$n$']
                
sns.set(font='Arial')
sns.set_style('ticks')
fig = plt.figure(figsize=(5,5))

gs = GridSpec(12,18)
gs.update(wspace=0.05, hspace=0.05)

ax_joint = fig.add_subplot(gs[4:12, 0:14])
ax_marg_x = fig.add_subplot(gs[0:4,0:14])
ax_marg_y = fig.add_subplot(gs[4:12, 14:18])

ax_joint.scatter(x,y, alpha=0.6 )

ax_marg_x.hist(x, bins=60)
kde_x = stats.gaussian_kde(x)
space_x = np.linspace(min(x),max(x),15)
ax_marg_x.plot(space_x, kde_x(space_x), linewidth=1)

ax_marg_y.hist(y,bins=40, orientation="horizontal")
kde_y = stats.gaussian_kde(y)
space_y = np.linspace(min(y),max(y),20)
ax_marg_y.plot( kde_y(space_y), space_y, linewidth=1)

ax_joint.set_xlim((0, 1.5))
ax_joint.set_ylim((0, 0.8))
ax_marg_x.set_xlim((0, 1.5))
ax_marg_y.set_ylim((0, 0.8))

ax_marg_x.axis('off')
ax_marg_y.axis('off')

# Set labels on joint
ax_joint.set_ylabel('$n$', rotation=0, fontsize=16, labelpad=10)
ax_joint.set_xlabel('$k$', fontsize=16)
Out[146]:
Text(0.5, 0, '$k$')
In [147]:
# linear regression log[I]~t to extract r^2_t for comparison

coef = []
inte = []
r2d = []
location = []
aic = []
bic = []
for loca in sorted(set(df6.loc_code)):
    subset = df6[df6.loc_code == loca]
    mi = subset['log [$I$]'].idxmax()
    y=subset['log [$I$]']
    X=subset['days']
    X = sm.add_constant(X)
    ols = sm.OLS(y,X).fit()
    aic.append(ols.aic)
    bic.append(ols.bic)
    if len(ols.params.values)==2:
        coef.append(ols.params.values[1])
        inte.append(ols.params.values[0])
        r2d.append(ols.rsquared)
        location.append(loca)
    else:
        print(loca)
In [148]:
stat.mean(aic), stat.mean(bic)
Out[148]:
(4.201124838998407, 4.982463942999741)
In [149]:
collect['r2_d'] = r2d
In [150]:
collect.head()
Out[150]:
loc_code tests inte_ld coef_ld r2_ld $n$ $m$ $k$ $ko$ r2_d
87 AFG NaN 1.220372 1.742216 0.998878 0.426018 2.347317 3.510046 0.064497 0.931892
188 ALB NaN 4.465838 1.384515 0.991902 0.277726 3.600677 34.845028 0.332266 0.896735
25183 ARE NaN 0.525894 1.799161 0.990956 0.444185 2.251312 2.409983 0.021433 0.968232
888 ARG NaN -0.288941 2.054539 0.996448 0.513273 1.948282 1.784995 0.110342 0.965581
1057 ARM NaN 2.391424 2.486335 0.998232 0.597802 1.672796 6.505427 0.063202 0.937885
In [151]:
# compare the r^2 values of the two models
r_sq = collect[['r2_ld','r2_d']].copy()
r_sq.rename(columns={'r2_ld': 'ln[$I$] ~ ln $t$', 'r2_d': 'ln[$I$] ~ $t$'}, inplace=True)

r_sq_melt = pd.melt(r_sq)
r_sq_melt.rename(columns={'variable': 'Regression', 'value': '$R^2$'}, inplace=True)
#r_sq_melt.to_csv('r_sq_melt {d}.csv'.format(d=date_str))

plt.figure(figsize=(6,6))
ax= sns.histplot(r_sq_melt, x='$R^2$', hue='Regression', bins=50, kde=True, palette="tab10")
In [ ]:
 

Model country level "disease characteristic" parameters against socioeconomic and demographic data

In [152]:
# extract dmographic and socioeconomical data 
demogr = ['loc_code', 'location', 'continent', 'pop_den','median_age', 'life_expectancy', 'older_65','older_70', 
          'gdp_pc', 'cvd_death_rate','diabetes_prevalence', 'female_smokers', 'male_smokers','hospital_beds_pm']
demo_data = df6[demogr].loc[index_d].copy() 

# combine it with disease characteristic parameters
df7 = pd.concat([collect, demo_data], axis=1)

#df7.to_csv('df7 {date}.csv'.format(date=date_str), index=False)
In [153]:
# The data correlation heatmap without the case numbers

Var_Corr = round(df7.corr(), 2)
dropSelf = np.zeros_like(Var_Corr)
dropSelf[np.triu_indices_from(dropSelf)] = True

plt.figure(figsize = (12,8))
g = sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns,cmap='coolwarm', vmax=0.8, 
                center=0,annot=True, mask=dropSelf)
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left');
In [154]:
df8 = df7.drop(columns=['life_expectancy', 'older_65', 'older_70'])

Var_Corr = round(df8.corr(), 2)
dropSelf = np.zeros_like(Var_Corr)
dropSelf[np.triu_indices_from(dropSelf)] = True

plt.figure(figsize = (12,8))
g = sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns,cmap='coolwarm', vmax=0.8, 
                center=0,annot=True, mask=dropSelf)
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left');

Random Forest Model: n and ko as target variables

In [155]:
df9 = pd.concat([df8, pd.get_dummies(df8['continent'], prefix='con')],axis=1).dropna()
# this is one-hot encoding, now drop one of the continents to make it a dummy encoding
# and drop the original 'continent' column
df9.drop(['continent', 'con_Europe'],axis=1, inplace=True)
In [156]:
X9 = df9[['pop_den', 'median_age', 'gdp_pc',
       'cvd_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'hospital_beds_pm', 'tests', 'con_Africa', 'con_Asia',
       'con_North_America', 'con_Oceania', 'con_South_America']].copy()

y9 = df9[['$n$','$ko$']].copy()
In [157]:
# Modeling for n
In [158]:
# Screen for the optimum n-estimator and max_features selector.

ensemble_models = [
    ("RandomForestRegressor, max_features='sqrt'",
        RandomForestRegressor(n_estimators=100,
                               warm_start=True, oob_score=True,
                               max_features="sqrt",
                               random_state=123)),
    ("RandomForestRegressor, max_features='auto'",
            RandomForestRegressor(n_estimators=100,
                                   warm_start=True, oob_score=True,
                                   max_features="auto",
                                   random_state=123))]

# Map classifier names to a list of (<n_estimators>, <error rate>) pairs.
from collections import OrderedDict
error_rate = OrderedDict((label, []) for label, _ in ensemble_models)

# Set the range of `n_estimators` values to explore.
min_estimators = 20
max_estimators = 300

# Screen for the optimum n-estimator and max_features selector.
for label, model in ensemble_models:
    for i in range(min_estimators, max_estimators + 1):
        model.set_params(n_estimators=i)
        model.fit(X9, y9['$n$'])

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - model.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, model_err in error_rate.items():
    xs, ys = zip(*model_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("Number of trees")
plt.ylabel("OOB error rate")
plt.legend(loc="best")
Out[158]:
<matplotlib.legend.Legend at 0xf21ef88>
In [159]:
rf_1 = RandomForestRegressor(n_estimators= 20, warm_start=True, max_features='sqrt', 
                           oob_score=True, random_state=123)
rf_1.fit(X9, y9['$n$'])
print("oob_score:", rf_1.oob_score_)
print("    score:", rf_1.score(X9, y9['$n$']))

### Importance analysis
importances = list(rf_1.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(list(X9.columns), importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
    
importance_df = pd.DataFrame(feature_importances)
importance_df.columns = ['features', 'importance']
#importance_df.to_csv('importance_n.csv')

plt.figure(figsize = (12,5))
g = sns.barplot(x="features", y="importance",  data=importance_df[importance_df['importance']>0.0001])
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left')
g.tick_params(labelsize=15)
g.set_xlabel('Features', fontsize=20)
g.set_ylabel('Importance', fontsize=20)

fig=g.get_figure()
fig.subplots_adjust(bottom=0.45)
oob_score: 0.18230716523859136
    score: 0.8953543542720892
In [ ]:
 
In [160]:
# Modeling for ko
In [161]:
# Screen for the optimum n-estimator and max_features selector.
ensemble_models = [
    ("RandomForestRegressor, max_features='sqrt'",
        RandomForestRegressor(n_estimators=100,
                               warm_start=True, oob_score=True,
                               max_features="sqrt",
                               random_state=123)),
    ("RandomForestRegressor, max_features='auto'",
            RandomForestRegressor(n_estimators=100,
                                   warm_start=True, oob_score=True,
                                   max_features="auto",
                                   random_state=123))]

# Map classifier names to a list of (<n_estimators>, <error rate>) pairs.
from collections import OrderedDict
error_rate = OrderedDict((label, []) for label, _ in ensemble_models)

# Set the range of `n_estimators` values to explore.
min_estimators = 20
max_estimators = 300

for label, model in ensemble_models:
    for i in range(min_estimators, max_estimators + 1):
        model.set_params(n_estimators=i)
        model.fit(X9, np.log(y9['$ko$']))

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - model.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, model_err in error_rate.items():
    xs, ys = zip(*model_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("Number of trees")
plt.ylabel("OOB error rate")
plt.legend(loc="best")
Out[161]:
<matplotlib.legend.Legend at 0xc3f5388>
In [162]:
rf_1 = RandomForestRegressor(n_estimators= 75, warm_start=True, max_features='auto', 
                           oob_score=True, random_state=123)
rf_1.fit(X9, np.log(y9['$ko$']))
print("oob_score:", rf_1.oob_score_)
print("    score:", rf_1.score(X9, np.log(y9['$ko$'])))

### Importance analysis
importances = list(rf_1.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(list(X9.columns), importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
    
importance_df = pd.DataFrame(feature_importances)
importance_df.columns = ['features', 'importance']
#importance_df.to_csv('importance_ko.csv')

plt.figure(figsize = (12,5))
g = sns.barplot(x="features", y="importance",  data=importance_df[importance_df['importance']>0.000001])
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left')
g.tick_params(labelsize=15)
g.set_xlabel('Features', fontsize=20)
g.set_ylabel('Importance', fontsize=20)

fig=g.get_figure()
fig.subplots_adjust(bottom=0.45)
oob_score: 0.580750043957389
    score: 0.9442909074141708
In [163]:
imp_ko = importance_df.sort_values(by=['importance'], ascending=False)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: